Musings on Deep Learning: Properties of SGD

نویسندگان

  • Chiyuan Zhang
  • Qianli Liao
  • Alexander Rakhlin
  • Brando Miranda
  • Noah Golowich
  • Tomaso Poggio
چکیده

We ruminate with a mix of theory and experiments on the optimization and generalization properties of deep convolutional networks trained with Stochastic Gradient Descent in classification tasks. A present perceived puzzle is that deep networks show good predictive performance when overparametrization relative to the number of training data suggests overfitting. We dream an explanation of these empirical results in terms of the following new results on SGD: 1. SGD concentrates in probability like the classical Langevin equation – on large volume, “flat” minima, selecting flat minimizers which are with very high probability also global minimizers. 2. Minimization by GD or SGD on flat minima can be approximated well by minimization on a linear funcion of the weights, suggesting pseudoinverse solutions. 3. Pseudoinverse solutions are known to be intrinsically regularized with a regularization parameter λ which decreases as 1 T where T is the number of iterations. This can qualitatively explain all the generalization properties empirically observed for deep networks. 4. GD and SGD are connected closely to robust optimization. This provides an alternative way to show that GD and SGD perform implicit regularization. These results explain the puzzling findings about fitting randomly labeled data while performing well on natural labeled data. They also explain while overparametrization does not result in overfitting. Quantitative, non-vacuous bounds are still missing, as it has almost always been the case for most practical applications of machine learning. This is version 3, which differs from version 2 only in the title. The first version was released on 04/04/2017 at https://dspace.mit.edu/handle/1721.1/107841.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

YellowFin and the Art of Momentum Tuning

Adaptive Optimization Hyperparameter tuning is a big cost of deep learning. Momentum: a key hyperparameter to SGD and variants. Adaptive methods, e.g. Adam1, don’t tune momentum. YellowFin optimizer • Based on the robustness properties of momentum. • Auto-tuning of momentum and learning rate in SGD. • Closed-loop momentum control for async. training. Experiments ResNet and LSTM YellowFin runs w...

متن کامل

Theory of Deep Learning III: Generalization Properties of SGD

In Theory III we characterize with a mix of theory and experiments the consistency and generalization properties of deep convolutional networks trained with Stochastic Gradient Descent in classification tasks. A present perceived puzzle is that deep networks show good predicitve performance when overparametrization relative to the number of training data suggests overfitting. We describe an exp...

متن کامل

"Oddball SGD": Novelty Driven Stochastic Gradient Descent for Training Deep Neural Networks

Stochastic Gradient Descent (SGD) is arguably the most popular of the machine learning methods applied to training deep neural networks (DNN) today. It has recently been demonstrated that SGD can be statistically biased so that certain elements of the training set are learned more rapidly than others. In this article, we place SGD into a feedback loop whereby the probability of selection is pro...

متن کامل

How to scale distributed deep learning?

Training time on large datasets for deep neural networks is the principal workflow bottleneck in a number of important applications of deep learning, such as object classification and detection in automatic driver assistance systems (ADAS). To minimize training time, the training of a deep neural network must be scaled beyond a single machine to as many machines as possible by distributing the ...

متن کامل

Annealed Gradient Descent for Deep Learning

Stochastic gradient descent (SGD) has been regarded as a successful optimization algorithm in machine learning. In this paper, we propose a novel annealed gradient descent (AGD) method for non-convex optimization in deep learning. AGD optimizes a sequence of gradually improved smoother mosaic functions that approximate the original non-convex objective function according to an annealing schedul...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017